grpc_service.proto

service InferenceService

Inference Server GRPC endpoints.

rpc ServerLive(ServerLiveRequest) returns
(ServerLiveResponse)

Check liveness of the inference server.

rpc ServerReady(ServerReadyRequest) returns
(ServerReadyResponse)

Check readiness of the inference server.

rpc ModelReady(ModelReadyRequest) returns
(ModelReadyResponse)

Check readiness of a model in the inference server.

rpc ServerMetadata(ServerMetadataRequest) returns
(ServerMetadataResponse)

Get server metadata.

rpc ModelMetadata(ModelMetadataRequest) returns
(ModelMetadataResponse)

Get model metadata.

rpc ModelInfer(ModelInferRequest) returns
(ModelInferResponse)

Perform inference using a specific model.

rpc ModelStreamInfer(stream ModelInferRequest) returns
(stream ModelStreamInferResponse)

Perform streaming inference.

rpc ModelConfig(ModelConfigRequest) returns
(ModelConfigResponse)

Get model configuration.

rpc ModelStatistics(
ModelStatisticsRequest)
returns (ModelStatisticsResponse)

Get the cumulative inference statistics for a model.

rpc RepositoryIndex(RepositoryIndexRequest) returns
(RepositoryIndexResponse)

Get the index of model repository contents.

rpc RepositoryModelLoad(RepositoryModelLoadRequest) returns
(RepositoryModelLoadResponse)

Load or reload a model from a repository.

rpc RepositoryModelUnload(RepositoryModelUnloadRequest)
returns (RepositoryModelUnloadResponse)

Unload a model.

rpc SystemSharedMemoryStatus(
SystemSharedMemoryStatusRequest)
returns (SystemSharedMemoryStatusRespose)

Get the status of all registered system-shared-memory regions.

rpc SystemSharedMemoryRegister(
SystemSharedMemoryRegisterRequest)
returns (SystemSharedMemoryRegisterResponse)

Register a system-shared-memory region.

rpc SystemSharedMemoryUnregister(
SystemSharedMemoryUnregisterRequest)
returns (SystemSharedMemoryUnregisterResponse)

Unregister a system-shared-memory region.

rpc CudaSharedMemoryStatus(
CudaSharedMemoryStatusRequest)
returns (CudaSharedMemoryStatusRespose)

Get the status of all registered CUDA-shared-memory regions.

rpc CudaSharedMemoryRegister(
CudaSharedMemoryRegisterRequest)
returns (CudaSharedMemoryRegisterResponse)

Register a CUDA-shared-memory region.

rpc CudaSharedMemoryUnregister(
CudaSharedMemoryUnregisterRequest)
returns (CudaSharedMemoryUnregisterResponse)

Unregister a CUDA-shared-memory region.

message ServerLiveRequest

Request message for ServerLive.

message ServerLiveResponse

Response message for ServerLive.

bool live

True if the inference server is live, false it not live.

message ServerReadyRequest

Request message for ServerReady.

message ServerReadyResponse

Response message for ServerReady.

bool ready

True if the inference server is ready, false it not ready.

message ModelReadyRequest

Request message for ModelReady.

string name

The name of the model to check for readiness.

string version

The version of the model to check for readiness. If not given the server will choose a version based on the model and internal policy.

message ModelReadyResponse

Response message for ModelReady.

bool ready

True if the model is ready, false it not ready.

message ServerMetadataRequest

Request message for ServerMetadata.

message ServerMetadataResponse

Response message for ServerMetadata.

string name

The server name.

string version

The server version.

string extensions(repeated)

The extensions supported by the server.

message ModelMetadataRequest

Request message for ModelMetadata.

string name

The name of the model.

string version

The version of the model to check for readiness. If not given the server will choose a version based on the model and internal policy.

message ModelMetadataResponse

Response message for ModelMetadata.

message TensorMetadata

Metadata for a tensor.

string name

The tensor name.

string datatype

The tensor data type.

int64 shape(repeated)

The tensor shape. A variable-size dimension is represented by a -1 value.

string name

The model name.

string versions(repeated)

The versions of the model.

string platform

The model’s platform.

TensorMetadata inputs(repeated)

The model’s inputs.

TensorMetadata outputs(repeated)

The model’s outputs.

message InferParameter

An inference parameter value.

oneof parameter_choice

The parameter value can be a string, an int64 or a boolean

bool bool_param

A boolean parameter value.

int64 int64_param

An int64 parameter value.

string string_param

A string parameter value.

message InferTensorContents

The data contained in a tensor represented by the repeated type that matches the tensor’s data type. Protobuf oneof is not used because oneofs cannot contain repeated fields.

bool bool_contents(repeated)

Representation for BOOL data type. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

int32 int_contents(repeated)

Representation for INT8, INT16, and INT32 data types. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

int64 int64_contents(repeated)

Representation for INT64 data types. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

uint32 uint_contents(repeated)

Representation for UINT8, UINT16, and UINT32 data types. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

uint64 uint64_contents(repeated)

Representation for UINT64 data types. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

float fp32_contents(repeated)

Representation for FP32 data type. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

double fp64_contents(repeated)

Representation for FP64 data type. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

bytes byte_contents(repeated)

Representation for BYTES data type. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.

message ModelInferRequest

Request message for ModelInfer.

message InferInputTensor

An input tensor for an inference request.

string name

The tensor name.

string datatype

The tensor data type.

int64 shape(repeated)

The tensor shape.

map<string, InferParameter> parameters

Optional inference input tensor parameters.

InferTensorContents contents

The tensor contents using a data-type format. This field must not be specified if tensor contents are being specified in ModelInferRequest.raw_input_contents.

message InferRequestedOutputTensor

An output tensor requested for an inference request.

string name

The tensor name.

map<string, InferParameter> parameters

Optional requested output tensor parameters.

string model_name

The name of the model to use for inferencing.

string model_version

The version of the model to use for inference. If not given the latest/most-recent version of the model is used.

string id

Optional identifier for the request. If specified will be returned in the response.

map<string, InferParameter> parameters

Optional inference parameters.

InferInputTensor inputs(repeated)

The input tensors for the inference.

InferRequestedOutputTensor outputs(repeated)

The requested output tensors for the inference. Optional, if not specified all outputs specified in the model config will be returned.

bytes raw_input_contents

The data contained in an input tensor can be represented in “raw” bytes form or in the repeated type that matches the tensor’s data type. Using the “raw” bytes form will typically allow higher performance due to the way protobuf allocation and reuse interacts with GRPC. For example, see https://github.com/grpc/grpc/issues/23231.

To use the raw representation ‘raw_input_contents’ must be initialized with data for each tensor in the same order as ‘inputs’. For each tensor, the size of this content must match what is expected by the tensor’s shape and data type. The raw data must be the flattened, one-dimensional, row-major order of the tensor elements without any stride or padding between the elements. Note that the FP16 data type must be represented as raw content as there is no specific data type for a 16-bit float type.

If this field is specified then InferInputTensor::contents must not be specified for any input tensor.

message ModelInferResponse

Response message for ModelInfer.

message InferOutputTensor

An output tensor returned for an inference request.

string name

The tensor name.

string datatype

The tensor data type.

int64 shape(repeated)

The tensor shape.

map<string, InferParameter> parameters

Optional output tensor parameters.

InferTensorContents contents

The tensor contents using a data-type format. This field must not be specified if tensor contents are being specified in ModelInferResponse.raw_output_contents.

string model_name

The name of the model used for inference.

string model_version

The version of the model used for inference.

string id

The id of the inference request if one was specified.

map<string, InferParameter> parameters

Optional inference response parameters.

InferOutputTensor outputs(repeated)

The output tensors holding inference results.

bytes raw_output_contents

The data contained in an output tensor can be represented in “raw” bytes form or in the repeated type that matches the tensor’s data type. Using the “raw” bytes form will typically allow higher performance due to the way protobuf allocation and reuse interacts with GRPC. For example, see https://github.com/grpc/grpc/issues/23231.

To use the raw representation ‘raw_output_contents’ must be initialized with data for each tensor in the same order as ‘outputs’. For each tensor, the size of this content must match what is expected by the tensor’s shape and data type. The raw data must be the flattened, one-dimensional, row-major order of the tensor elements without any stride or padding between the elements. Note that the FP16 data type must be represented as raw content as there is no specific data type for a 16-bit float type.

If this field is specified then InferOutputTensor::contents must not be specified for any output tensor.

message ModelStreamInferResponse

Response message for ModelStreamInfer.

string error_message

The message describing the error. The empty message indicates the inference was successful without errors.

ModelInferResponse infer_response

Holds the results of the request.

message ModelConfigRequest

Request message for ModelConfig.

string name

The name of the model.

string version

The version of the model. If not given the model version is selected automatically based on the version policy.

message ModelConfigResponse

Response message for ModelConfig.

ModelConfig config

The model configuration.

message ModelStatisticsRequest

Request message for ModelStatistics.

string name

The name of the model. If not given returns statistics for all models.

string version

The version of the model. If not given returns statistics for all model versions.

message StatisticDuration

Statistic recording a cumulative duration metric.

uint64 count

Cumulative number of times this metric occurred.

uint64 total_time_ns

Total collected duration of this metric in nanoseconds.

message InferStatistics

Inference statistics.

StatisticDuration success

Cumulative count and duration for successful inference request.

StatisticDuration fail

Cumulative count and duration for failed inference request.

StatisticDuration queue

The count and cumulative duration that inference requests wait in scheduling or other queues.

StatisticDuration compute_input

The count and cumulative duration to prepare input tensor data as required by the model framework / backend. For example, this duration should include the time to copy input tensor data to the GPU.

StatisticDuration compute_infer

The count and cumulative duration to execute the model.

StatisticDuration compute_output

The count and cumulative duration to extract output tensor data produced by the model framework / backend. For example, this duration should include the time to copy output tensor data from the GPU.

message InferBatchStatistics

Inference batch statistics.

uint64 batch_size

The size of the batch.

StatisticDuration compute_input

The count and cumulative duration to prepare input tensor data as required by the model framework / backend with the given batch size. For example, this duration should include the time to copy input tensor data to the GPU.

StatisticDuration compute_infer

The count and cumulative duration to execute the model with the given batch size.

StatisticDuration compute_output

The count and cumulative duration to extract output tensor data produced by the model framework / backend with the given batch size. For example, this duration should include the time to copy output tensor data from the GPU.

message ModelStatistics

Statistics for a specific model and version.

string name

The name of the model. If not given returns statistics for all

string version

The version of the model.

uint64 last_inference

The timestamp of the last inference request made for this model, as milliseconds since the epoch.

uint64 last_inference

The cumulative count of successful inference requests made for this model. Each inference in a batched request is counted as an individual inference. For example, if a client sends a single inference request with batch size 64, “inference_count” will be incremented by 64. Similarly, if a clients sends 64 individual requests each with batch size 1, “inference_count” will be incremented by 64.

uint64 last_inference

The cumulative count of the number of successful inference executions performed for the model. When dynamic batching is enabled, a single model execution can perform inferencing for more than one inference request. For example, if a clients sends 64 individual requests each with batch size 1 and the dynamic batcher batches them into a single large batch for model execution then “execution_count” will be incremented by 1. If, on the other hand, the dynamic batcher is not enabled for that each of the 64 individual requests is executed independently, then “execution_count” will be incremented by 64.

InferStatistics inference_stats

The aggregate statistics for the model/version.

InferBatchStatistics batch_stats(repeated)

The aggregate statistics for each different batch size that is executed in the model. The batch statistics indicate how many actual model executions were performed and show differences due to different batch size (for example, larger batches typically take longer to compute).

message ModelStatisticsResponse

Response message for ModelStatistics.

ModelStatistics model_stats(repeated)

Statistics for each requested model.

message RepositoryIndexRequest

Request message for RepositoryIndex.

string repository_name

The name of the repository. If empty the index is returned for all repositories.

bool ready

If true returned only models currently ready for inferencing.

message RepositoryIndexResponse

Response message for RepositoryIndex.

message ModelIndex

Index entry for a model.

string name

The name of the model.

string version

The version of the model.

string state

The state of the model.

string reason

The reason, if any, that the model is in the given state.

ModelIndex models(repeated)

An index entry for each model.

message RepositoryModelLoadRequest

Request message for RepositoryModelLoad.

string repository_name

The name of the repository to load from. If empty the model is loaded from any repository.

string repository_name

The name of the model to load, or reload.

message RepositoryModelLoadResponse

Response message for RepositoryModelLoad.

message RepositoryModelUnloadRequest

Request message for RepositoryModelUnload.

string repository_name

The name of the repository from which the model was originally loaded. If empty the repository is not considered.

string repository_name

The name of the model to unload.

message RepositoryModelUnloadResponse

Response message for RepositoryModelUnload.

message SystemSharedMemoryStatusRequest

Request message for SystemSharedMemoryStatus.

string name

The name of the region to get status for. If empty the status is returned for all registered regions.

message SystemSharedMemoryStatusResponse

Response message for SystemSharedMemoryStatus.

message RegionStatus

Status for a shared memory region.

string name

The name for the shared memory region.

string shared_memory_key

The key of the underlying memory object that contains the shared memory region.

uint64 offset

Offset, in bytes, within the underlying memory object to the start of the shared memory region.

uint64 byte_size

Size of the shared memory region, in bytes.

map<string, RegionStatus> regions

Status for each of the registered regions, indexed by region name.

message SystemSharedMemoryRegisterRequest

Request message for SystemSharedMemoryRegister.

string name

The name of the region to register.

string shared_memory_key

The key of the underlying memory object that contains the shared memory region.

uint64 offset

Offset, in bytes, within the underlying memory object to the start of the shared memory region.

uint64 byte_size

Size of the shared memory region, in bytes.

message SystemSharedMemoryRegisterResponse

Response message for SystemSharedMemoryRegister.

message SystemSharedMemoryUnregisterRequest

Request message for SystemSharedMemoryUnregister.

string name

The name of the system region to unregister. If empty all system shared-memory regions are unregistered.

message SystemSharedMemoryUnregisterResponse

Response message for SystemSharedMemoryUnregister.

message CudaSharedMemoryStatusRequest

Request message for CudaSharedMemoryStatus.

string name

The name of the region to get status for. If empty the status is returned for all registered regions.

message CudaSharedMemoryStatusResponse

Response message for CudaSharedMemoryStatus.

message RegionStatus

Status for a shared memory region.

string name

The name for the shared memory region.

uin64 device_id

The GPU device ID where the cudaIPC handle was created.

uint64 byte_size

Size of the shared memory region, in bytes.

map<string, RegionStatus> regions

Status for each of the registered regions, indexed by region name.

message CudaSharedMemoryRegisterRequest

Request message for CudaSharedMemoryRegister.

string name

The name of the region to register.

bytes raw_handle

The raw serialized cudaIPC handle.

int64 device_id

The GPU device ID on which the cudaIPC handle was created.

uint64 byte_size

Size of the shared memory block, in bytes.

message CudaSharedMemoryRegisterResponse

Response message for CudaSharedMemoryRegister.

message CudaSharedMemoryUnregisterRequest

Request message for CudaSharedMemoryUnregister.

string name

The name of the cuda region to unregister. If empty all cuda shared-memory regions are unregistered.

message CudaSharedMemoryUnregisterResponse

Response message for CudaSharedMemoryUnregister.