grpc_service.proto¶
-
service
InferenceService
¶ Inference Server GRPC endpoints.
-
rpc ServerLive(ServerLiveRequest) returns
-
(ServerLiveResponse)
Check liveness of the inference server.
-
rpc ServerReady(ServerReadyRequest) returns
-
(ServerReadyResponse)
Check readiness of the inference server.
-
rpc ModelReady(ModelReadyRequest) returns
-
(ModelReadyResponse)
Check readiness of a model in the inference server.
-
rpc ServerMetadata(ServerMetadataRequest) returns
-
(ServerMetadataResponse)
Get server metadata.
-
rpc ModelMetadata(ModelMetadataRequest) returns
-
(ModelMetadataResponse)
Get model metadata.
-
rpc ModelInfer(ModelInferRequest) returns
-
(ModelInferResponse)
Perform inference using a specific model.
-
rpc ModelStreamInfer(stream ModelInferRequest) returns
-
(stream ModelStreamInferResponse)
Perform streaming inference.
-
rpc ModelConfig(ModelConfigRequest) returns
-
(ModelConfigResponse)
Get model configuration.
-
rpc ModelStatistics(
-
ModelStatisticsRequest)
-
returns (
ModelStatisticsResponse
)¶ Get the cumulative inference statistics for a model.
-
rpc RepositoryIndex(RepositoryIndexRequest) returns
-
(RepositoryIndexResponse)
Get the index of model repository contents.
-
rpc RepositoryModelLoad(RepositoryModelLoadRequest) returns
-
(RepositoryModelLoadResponse)
Load or reload a model from a repository.
-
rpc
RepositoryModelUnload
(RepositoryModelUnloadRequest)¶ -
returns (
RepositoryModelUnloadResponse
)¶ Unload a model.
-
rpc SystemSharedMemoryStatus(
-
SystemSharedMemoryStatusRequest)
Get the status of all registered system-shared-memory regions.
-
rpc SystemSharedMemoryRegister(
-
SystemSharedMemoryRegisterRequest)
Register a system-shared-memory region.
-
rpc SystemSharedMemoryUnregister(
-
SystemSharedMemoryUnregisterRequest)
Unregister a system-shared-memory region.
-
rpc CudaSharedMemoryStatus(
-
CudaSharedMemoryStatusRequest)
Get the status of all registered CUDA-shared-memory regions.
-
rpc CudaSharedMemoryRegister(
-
CudaSharedMemoryRegisterRequest)
Register a CUDA-shared-memory region.
-
rpc CudaSharedMemoryUnregister(
-
CudaSharedMemoryUnregisterRequest)
Unregister a CUDA-shared-memory region.
-
-
message
ServerLiveRequest
¶ Request message for ServerLive.
-
message
ServerLiveResponse
¶ Response message for ServerLive.
-
bool
live
¶ True if the inference server is live, false it not live.
-
bool
-
message
ServerReadyRequest
¶ Request message for ServerReady.
-
message
ServerReadyResponse
¶ Response message for ServerReady.
-
bool
ready
¶ True if the inference server is ready, false it not ready.
-
bool
-
message
ModelReadyRequest
¶ Request message for ModelReady.
-
string
name
¶ The name of the model to check for readiness.
-
string
version
¶ The version of the model to check for readiness. If not given the server will choose a version based on the model and internal policy.
-
string
-
message
ModelReadyResponse
¶ Response message for ModelReady.
-
bool
ready
¶ True if the model is ready, false it not ready.
-
bool
-
message
ServerMetadataRequest
¶ Request message for ServerMetadata.
-
message
ServerMetadataResponse
¶ Response message for ServerMetadata.
-
string
name
¶ The server name.
-
string
version
¶ The server version.
-
string
extensions
(repeated)¶ The extensions supported by the server.
-
string
-
message
ModelMetadataRequest
¶ Request message for ModelMetadata.
-
string
name
¶ The name of the model.
-
string
version
¶ The version of the model to check for readiness. If not given the server will choose a version based on the model and internal policy.
-
string
-
message
ModelMetadataResponse
¶ Response message for ModelMetadata.
-
message
TensorMetadata
¶ Metadata for a tensor.
-
string
name
¶ The tensor name.
-
string
datatype
¶ The tensor data type.
-
int64
shape
(repeated)¶ The tensor shape. A variable-size dimension is represented by a -1 value.
-
string
-
string
name
¶ The model name.
-
string
versions
(repeated)¶ The versions of the model.
-
string
platform
¶ The model’s platform.
-
TensorMetadata
inputs
(repeated)¶ The model’s inputs.
-
TensorMetadata
outputs
(repeated)¶ The model’s outputs.
-
message
-
message
InferParameter
¶ An inference parameter value.
-
message
InferTensorContents
¶ The data contained in a tensor. For a given data type the tensor contents can be represented in “raw” bytes form or in the repeated type that matches the tensor’s data type. Protobuf oneof is not used because oneofs cannot contain repeated fields.
-
bytes
raw_contents
¶ Raw representation of the tensor contents. The size of this content must match what is expected by the tensor’s shape and data type. The raw data must be the flattened, one-dimensional, row-major order of the tensor elements without any stride or padding between the elements. Note that the FP16 data type must be represented as raw content as there is no standard support for a 16-bit float type.
-
bool
bool_contents
(repeated)¶ Representation for BOOL data type. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.
-
int32
int_contents
(repeated)¶ Representation for INT8, INT16, and INT32 data types. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.
-
int64
int64_contents
(repeated)¶ Representation for INT64 data types. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.
-
uint32
uint_contents
(repeated)¶ Representation for UINT8, UINT16, and UINT32 data types. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.
-
uint64
uint64_contents
(repeated)¶ Representation for UINT64 data types. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.
-
float
fp32_contents
(repeated)¶ Representation for FP32 data type. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.
-
double
fp64_contents
(repeated)¶ Representation for FP64 data type. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.
-
bytes
byte_contents
(repeated)¶ Representation for BYTES data type. The size must match what is expected by the tensor’s shape. The contents must be the flattened, one-dimensional, row-major order of the tensor elements.
-
bytes
-
message
ModelInferRequest
¶ Request message for ModelInfer.
-
message
InferInputTensor
¶ An input tensor for an inference request.
-
string
name
¶ The tensor name.
-
string
datatype
¶ The tensor data type.
-
int64
shape
(repeated)¶ The tensor shape.
-
map<string, InferParameter>
parameters
¶ Optional inference input tensor parameters.
-
InferTensorContents
The input tensor data.
-
string
-
message
InferRequestedOutputTensor
¶ An output tensor requested for an inference request.
-
string
name
¶ The tensor name.
-
map<string, InferParameter>
parameters
¶ Optional requested output tensor parameters.
-
string
-
string
model_name
¶ The name of the model to use for inferencing.
-
string
model_version
¶ The version of the model to use for inference. If not given the latest/most-recent version of the model is used.
-
string
id
¶ Optional identifier for the request. If specified will be returned in the response.
-
map<string, InferParameter>
parameters
¶ Optional inference parameters.
-
InferInputTensor
inputs
(repeated)¶ The input tensors for the inference.
-
InferRequestedOutputTensor
outputs
(repeated)¶ The requested output tensors for the inference. Optional, if not specified all outputs specified in the model config will be returned.
-
message
-
message
ModelInferResponse
¶ Response message for ModelInfer.
-
message
InferOutputTensor
¶ An output tensor returned for an inference request.
-
string
name
¶ The tensor name.
-
string
datatype
¶ The tensor data type.
-
int64
shape
(repeated)¶ The tensor shape.
-
map<string, InferParameter>
parameters
¶ Optional output tensor parameters.
-
InferTensorContents
The output tensor data.
-
string
-
string
model_name
¶ The name of the model used for inference.
-
string
model_version
¶ The version of the model used for inference.
-
string
id
¶ The id of the inference request if one was specified.
-
map<string, InferParameter>
parameters
¶ Optional inference response parameters.
-
InferOutputTensor
outputs
(repeated)¶ The output tensors holding inference results.
-
message
-
message
ModelStreamInferResponse
¶ Response message for ModelStreamInfer.
-
string
error_message
¶ The message describing the error. The empty message indicates the inference was successful without errors.
-
ModelInferResponse
infer_response
¶ Holds the results of the request.
-
string
-
message
ModelConfigRequest
¶ Request message for ModelConfig.
-
string
name
¶ The name of the model.
-
string
version
¶ The version of the model. If not given the model version is selected automatically based on the version policy.
-
string
-
message
ModelConfigResponse
¶ Response message for ModelConfig.
-
ModelConfig
config
¶ The model configuration.
-
ModelConfig
-
message
ModelStatisticsRequest
¶ Request message for ModelStatistics.
-
string
name
¶ The name of the model. If not given returns statistics for all models.
-
string
version
¶ The version of the model. If not given returns statistics for all model versions.
-
string
-
message
StatisticDuration
¶ Statistic recording a cumulative duration metric.
-
uint64
count
¶ Cumulative number of times this metric occurred.
-
uint64
total_time_ns
¶ Total collected duration of this metric in nanoseconds.
-
uint64
-
message
InferStatistics
¶ Inference statistics.
-
StatisticDuration
success
¶ Cumulative count and duration for successful inference request.
-
StatisticDuration
fail
¶ Cumulative count and duration for failed inference request.
-
StatisticDuration
queue
¶ The count and cumulative duration that inference requests wait in scheduling or other queues.
-
StatisticDuration
compute_input
¶ The count and cumulative duration to prepare input tensor data as required by the model framework / backend. For example, this duration should include the time to copy input tensor data to the GPU.
-
StatisticDuration
compute_infer
¶ The count and cumulative duration to execute the model.
-
StatisticDuration
compute_output
¶ The count and cumulative duration to extract output tensor data produced by the model framework / backend. For example, this duration should include the time to copy output tensor data from the GPU.
-
StatisticDuration
-
message
InferBatchStatistics
¶ Inference batch statistics.
-
uint64
batch_size
¶ The size of the batch.
-
StatisticDuration
compute_input
¶ The count and cumulative duration to prepare input tensor data as required by the model framework / backend with the given batch size. For example, this duration should include the time to copy input tensor data to the GPU.
-
StatisticDuration
compute_infer
¶ The count and cumulative duration to execute the model with the given batch size.
-
StatisticDuration
compute_output
¶ The count and cumulative duration to extract output tensor data produced by the model framework / backend with the given batch size. For example, this duration should include the time to copy output tensor data from the GPU.
-
uint64
-
message
ModelStatistics
¶ Statistics for a specific model and version.
-
string
name
¶ The name of the model. If not given returns statistics for all
-
string
version
¶ The version of the model.
-
uint64
last_inference
¶ The timestamp of the last inference request made for this model, as milliseconds since the epoch.
-
uint64
last_inference
The cumulative count of successful inference requests made for this model. Each inference in a batched request is counted as an individual inference. For example, if a client sends a single inference request with batch size 64, “inference_count” will be incremented by 64. Similarly, if a clients sends 64 individual requests each with batch size 1, “inference_count” will be incremented by 64.
-
uint64
last_inference
The cumulative count of the number of successful inference executions performed for the model. When dynamic batching is enabled, a single model execution can perform inferencing for more than one inference request. For example, if a clients sends 64 individual requests each with batch size 1 and the dynamic batcher batches them into a single large batch for model execution then “execution_count” will be incremented by 1. If, on the other hand, the dynamic batcher is not enabled for that each of the 64 individual requests is executed independently, then “execution_count” will be incremented by 64.
-
InferStatistics
inference_stats
¶ The aggregate statistics for the model/version.
-
InferBatchStatistics
batch_stats
(repeated)¶ The aggregate statistics for each different batch size that is executed in the model. The batch statistics indicate how many actual model executions were performed and show differences due to different batch size (for example, larger batches typically take longer to compute).
-
string
-
message
ModelStatisticsResponse
¶ Response message for ModelStatistics.
-
ModelStatistics
model_stats
(repeated)¶ Statistics for each requested model.
-
ModelStatistics
-
message
RepositoryIndexRequest
¶ Request message for RepositoryIndex.
-
string
repository_name
¶ The name of the repository. If empty the index is returned for all repositories.
-
bool
ready
¶ If true returned only models currently ready for inferencing.
-
string
-
message
RepositoryIndexResponse
¶ Response message for RepositoryIndex.
-
message
ModelIndex
¶ Index entry for a model.
-
string
name
¶ The name of the model.
-
string
version
¶ The version of the model.
-
string
state
¶ The state of the model.
-
string
reason
¶ The reason, if any, that the model is in the given state.
-
string
-
ModelIndex
models
(repeated)¶ An index entry for each model.
-
message
-
message
RepositoryModelLoadRequest
¶ Request message for RepositoryModelLoad.
-
string
repository_name
¶ The name of the repository to load from. If empty the model is loaded from any repository.
-
string
repository_name
The name of the model to load, or reload.
-
string
-
message
RepositoryModelLoadResponse
¶ Response message for RepositoryModelLoad.
-
message
RepositoryModelUnloadRequest
¶ Request message for RepositoryModelUnload.
-
string
repository_name
¶ The name of the repository from which the model was originally loaded. If empty the repository is not considered.
-
string
repository_name
The name of the model to unload.
-
string
-
message
RepositoryModelUnloadResponse
¶ Response message for RepositoryModelUnload.
Request message for SystemSharedMemoryStatus.
The name of the region to get status for. If empty the status is returned for all registered regions.
Response message for SystemSharedMemoryStatus.
Status for a shared memory region.
The name for the shared memory region.
The key of the underlying memory object that contains the shared memory region.
Offset, in bytes, within the underlying memory object to the start of the shared memory region.
Size of the shared memory region, in bytes.
Status for each of the registered regions, indexed by region name.
Request message for SystemSharedMemoryRegister.
The name of the region to register.
The key of the underlying memory object that contains the shared memory region.
Offset, in bytes, within the underlying memory object to the start of the shared memory region.
Size of the shared memory region, in bytes.
Response message for SystemSharedMemoryRegister.
Request message for SystemSharedMemoryUnregister.
The name of the system region to unregister. If empty all system shared-memory regions are unregistered.
Response message for SystemSharedMemoryUnregister.
Request message for CudaSharedMemoryStatus.
The name of the region to get status for. If empty the status is returned for all registered regions.
Response message for CudaSharedMemoryStatus.
Status for a shared memory region.
The name for the shared memory region.
The GPU device ID where the cudaIPC handle was created.
Size of the shared memory region, in bytes.
Status for each of the registered regions, indexed by region name.
Request message for CudaSharedMemoryRegister.
The name of the region to register.
The raw serialized cudaIPC handle.
The GPU device ID on which the cudaIPC handle was created.
Size of the shared memory block, in bytes.
Response message for CudaSharedMemoryRegister.
Request message for CudaSharedMemoryUnregister.
The name of the cuda region to unregister. If empty all cuda shared-memory regions are unregistered.
Response message for CudaSharedMemoryUnregister.